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Characterization of a Novel 
Coronavirus Associated with Severe 
Acute Respiratory Syndrome 
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Silvia Peharanda,’ Bettina Bankamp,' Kaija Maher,’ 
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Stephan Giinther,* Albert D. M. E. Osterhaus,? 
Christian Drosten,* Mark A. Pallansch,' Larry J. Anderson, ' 
William J. Bellini’ 


In March 2003, a novel coronavirus (SARS-CoV) was discovered in association 
with cases of severe acute respiratory syndrome (SARS). The sequence of the 
complete genome of SARS-CoV was determined, and the initial characterization 
of the viral genome is presented in this report. The genome of SARS-CoV is 
29,727 nucleotides in length and has 11 open reading frames, and its genome 
organization is similar to that of other coronaviruses. Phylogenetic analyses and 
sequence comparisons showed that SARS-CoV is not closely related to any of 
the previously characterized coronaviruses. 


Several hundred cases of severe atypical pneu- 
monia of unknown etiology were reported in 
Guangdong Province of the People’s Republic of 
China beginning in late 2002. After similar cases 
were detected in patients in Hong Kong, Viet- 
nam, and Canada during February and March 
2003, the World Health Organization (WHO) 
issued a global alert for the illness, designated 
“severe acute respiratory syndrome” (SARS). In 
mid-March 2003, SARS was recognized in 
health care workers and household members who 
had cared for patients with severe respiratory 
illness in Hong Kong and Vietnam. Many of 
these cases could be traced through multiple 
chains of transmission to a health care worker 
from Guangdong Province who visited Hong 
Kong, where he was hospitalized with pneumo- 
nia and died. By late April 2003, over 4300 
SARS cases and 250 SARS-related deaths were 
reported to WHO from over 25 countries around 
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the world. Most of these cases occurred after 
exposure to SARS patients in household or health 
care settings. The incubation period for the dis- 
ease is usually from 2 to 7 days. Infection is 
usually characterized by fever, which is followed 
a few days later by a dry nonproductive cough 
and shortness of breath. Death from progressive 
respiratory failure occurs in about 3% to nearly 
10% of cases (1-4). 

In response to this outbreak, WHO coor- 
dinated an international collaboration that in- 
cluded clinical, epidemiologic, and laborato- 
ry investigations, and initiated efforts to 
control the spread of SARS. Attempts to iden- 
tify the etiology of the SARS outbreak were 
successful during the third week of March 
2003, when laboratories in the United States, 
Canada, Germany, and Hong Kong isolated a 
novel coronavirus (SARS-CoV) from SARS 
patients. Unlike other human coronaviruses, it 
was possible to isolate SARS-CoV in Vero 
cells. Evidence of SARS-CoV infection has 
now been documented in SARS patients 
throughout the world. SARS-CoV RNA has 
frequently been detected in respiratory speci- 
mens, and convalescent-phase serum speci- 
mens from SARS patients contain antibodies 
that react with SARS-CoV. There is strong 
evidence that this new virus is etiologically 
linked to the outbreak of SARS (5-7). 


The coronaviruses (order Nidovirales, fami- 
ly Coronaviridae, genus Coronavirus) are a 
diverse group of large, enveloped, positive- 
stranded RNA viruses that cause respiratory and 
enteric diseases in humans and other animals. At 
~30,000 nucleotides (nt), their genome is the 
largest found in any of the RNA viruses. There 
are three groups of coronaviruses; groups | and 
2 contain mammalian viruses, whereas group 3 
contains only avian viruses. Within each group, 
coronaviruses are classified into distinct species 
by host range, antigenic relationships, and 
genomic organization. Coronaviruses typically 
have narrow host ranges and are fastidious in 
cell culture. The viruses can cause severe dis- 
ease in many animals; and several viruses, 
including infectious bronchitis virus, feline in- 
fectious peritonitis virus, and transmissible gas- 
troenteritis virus, are important veterinary 
pathogens. Human coronaviruses (HCoVs) are 
found in both group 1 (HCoV-229E) and group 
2 (HCoV-OC43) and are responsible for ~30% 
of mild upper respiratory tract illnesses (8-10). 

Sequence analysis of a limited region of 
the replicase (rep) gene suggested that 
SARS-CoV was distinct from all other coro- 
naviruses (5—7). In this report, we compare 
the sequence of the entire genome of SARS- 
CoV (Urbani strain) to the genomic sequenc- 
es of other coronaviruses. 

Genome organization. The sequence of 
the entire genome of SARS-CoV (GenBank 
accession number AY278741) was obtained 
by several approaches (//). During comple- 
tion of this manuscript, other laboratories 
determined the genomic sequences of three 
additional strains of SARS-CoV. These nu- 
cleotide sequences vary at only 24 positions 
(table S3). 

The genome of SARS-CoV is a 29,727- 
nucleotide, polyadenylated RNA, and 41% of 
the residues are G or C (the range for published 
complete coronavirus genome sequences is 37 
to 42%). The genomic organization is typical of 
coronaviruses, having the characteristic gene or- 
der [5'-replicase (rep), spike (S), envelope (E), 
membrane (M), and nucleocapsid (N)-3'] and 
short untranslated regions at both termini (Fig. 
1A and table S1). The SARS-CoV rep gene, 
which comprises approximately two-thirds of 
the genome, is predicted to encode two polypro- 
teins (encoded by ORF la and ORF 1b) that un- 
dergo cotranslational proteolytic processing. 
There are four open reading frames (ORFs) 
downstream of rep that are predicted to encode 
the structural proteins S, E, M, and N, which are 
common to all known coronaviruses. The gene 
encoding hemagglutinin-esterase, which is 
present between ORF 1b and S in group 2 and 
some group 3 coronaviruses (8), was not found. 

Coronaviruses also encode a number of non- 
structural proteins that are located between S 
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and E, between M and N, or downstream of N. 
These nonstructural proteins, which vary widely 
among the different coronavirus species, are of 
unknown function and are dispensable for virus 
replication (8). The genome of SARS-CoV con- 
tains ORFs for five potential nonstructural pro- 
teins that are more than 50 amino acids long in 
these intergenic regions (Fig. 1B, Table 1, and 
table S1). Two overlapping ORFs encoding pre- 
dicted proteins of 274 and 154 amino acids 
(termed X1 and X2, respectively) are located 
between S and E. Three additional potential 
nonstructural genes, X3, X4, and X5 (encoding 
proteins of 63, 122, and 84 amino acids, respec- 
tively), are located between M and N. In addi- 
tion to the five ORFs encoding the predicted 
nonstructural proteins described above, there are 
also two smaller ORFs between M and N, en- 
coding predicted proteins of less than 50 amino 
acids (Table 1). Searches of the GenBank data- 
base (with BLAST and FastA) indicated that 
there is no significant sequence similarity be- 
tween these potential nonstructural proteins of 
SARS-CoV and any other proteins (/2). Note 
that there are ORFs encoding predicted proteins 
more than 50 amino acids long in the structural 
genes of SARS-CoV (such as N, S, and rep). 
Many short ORFs are present in the structural 
genes. They are unlikely to be expressed and, 
for simplicity, they are not shown in Fig. 1. 
The coronavirus rep gene products are trans- 
lated from genomic RNA, but the remaining 
viral proteins are translated from subgenomic 
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mRNAs that form a 3’-coterminal nested set, 
each with a 5’ end derived from the genomic 5’ 
leader sequence. The coronavirus subgenomic 
mRNAs are synthesized through a discontinu- 
ous transcription process, the mechanism of 
which has not been unequivocally established 
(8, 13). The SARS-CoV leader sequence was 
mapped by comparing the sequence of 5' RACE 
(rapid amplification of cDNA ends) (//) prod- 
ucts synthesized from the N gene mRNA with 
those synthesized from genomic RNA. A se- 
quence, AAACGAAC (genomic nucleotides 65 
to 72), was identified immediately upstream of 
the site where the N gene mRNA and genomic 
sequences diverged. This sequence was also 
present upstream of ORFla and immediately 
upstream of five other ORFs (Fig. 1, A and B, 
and table S1), suggesting that it functions as the 
conserved core of the transcription-regulating 
sequences (TRSs). The nucleotides required for 
TRS function must be identified experimentally. 

The favored model for production of sub- 
genomic mRNAs of coronaviruses proposes 
that discontinuous transcription occurs during 
synthesis of the negative strand (/3). Sub- 
genomic negative strands containing a comple- 
mentary copy of the leader sequence at their 3’ 
termini serve as templates for synthesis of sub- 
genomic mRNAs. In addition to the site at the 5’ 
terminus of the genome, the TRS conserved 
core sequence appears six times in the remain- 
der of the genome. The positions of the TRS in 
the genome of SARS-CoV predict that sub- 
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genomic mRNAs of 8.3, 4.5, 3.4, 2.5, 2.0, and 
1.7 kb, not including the poly(A) tail, should be 
produced (Fig. 1, A and B, and table S1). At 
least five subgenomic mRNAs were detected by 
Northern hybridization of RNA from SARS- 
CoV-infected cells, using a probe derived from 
the 3’ untranslated region (Fig. 1C). The calcu- 
lated sizes of the five predominant bands corre- 
spond to the sizes of five of the predicted 
subgenomic mRNAs of SARS-CoV; we cannot 
exclude the possibility that other, low- 
abundance mRNAs are present. Full-length 
genomic RNA was not detected, probably be- 
cause it is the least prevalent viral RNA in 
infected cells (8). The predicted 2.0-kb tran- 
script was also not detected, which suggests that 
the consensus TRS at nt 27,771 to 27,778 is not 
used or that it is a low-abundance mRNA. By 
analogy with other coronaviruses (8), the 8.3-kb 
and 1.7-kb subgenomic mRNAs are predicted to 
be monocistronic, directing translation of S and 
N, respectively, whereas multiple proteins could 
be translated from the 4.5-kb (X1, X2, and E), 
3.4-kb (M and X3), and 2.5-kb (X4 and X5) 
mRNAs. A consensus TRS is not found directly 
upstream of the ORF encoding the predicted E 
protein (/4), and a monocistronic mRNA that 
would be predicted to code for E could not be 
clearly identified by Northern blot analysis. It is 
possible that the 3.4-kb band contained more 
than one mRNA species that were not re- 
solved in the gel or that the monocistronic 
mRNA for E is a low-abundance message. 


30,0007 


Fig. 1. Genome organization 
and mRNA mapping of SARS- 
CoV. (A) Overall organization 
of the 29,727-nt SARS-CoV 
genomic RNA. The 72-nt leader 
sequence is represented by a 
small orange square at the 5’ 
terminus of the genome and 
the subgenomic mRNAs (be- 
low). Predicted ORFs 1a and 1b, 
encoding the nonstructural 
polyproteins, and those encod- 
ing the S, E, M, and N structural 


proteins are indicated. The vertical position of the boxes indicates the phase of the reading frame. (B) Expanded view of the structural protein coding region 
and predicted mRNA transcripts. Known structural protein coding regions (blue boxes) and reading frames X1 to X5, encoding potential nonstructural proteins 
longer than 50 amino acids (gray boxes), are indicated. Lengths and map locations of the 3’-coterminal mRNAs, as predicted by identification of conserved 
transcription-regulating sequences, are indicated. (C) Northern blot analysis of SARS-CoV mRNAs. Poly(A)* RNA was separated on a formaldehyde-agarose 
gel, transferred to a nylon membrane, and hybridized with a digoxigenin-labeled riboprobe overlapping the 3’ untranslated region. Signals were visualized by 
chemiluminescence. Sizes of the SARS-CoV mRNAs were calculated by interpolation from a log-linear fit of those of the molecular mass marker. Lane 1, 
SARS-CoV mRNA; lane 2, Vero E6 cell mRNA; lane 3, molecular mass marker (sizes in kilobases). 
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A 
HCoV-229E 
3cLh™ POL 
10 10 
SARS-CoV 
SARS-CoV 
G1 
S CCoV FCoV E HCoV-229E 
10 10 
SARS-CoV 
M SARS-CoV N 
10 10 
G2 
B 
Group Virus Pairwise Amino Acid Identity (Percent) 
3CLPRO POL HEL SJ} E M N 
Gi HCoV-229E 40.1 58.8 59.7 23.9 22.7 28.8 23.0 
PEDV 44.4 59.5 61.7 21.7 17.6 31.8 22.6 
TGEV 44.0 59.4 61.2 20.6 22.4 30.0 25.6 
G2 BCoV 48.8 66.3 68.3 27.1 20.0 39.7 31.9 
MHV 49.2 66.5 67.3 26.5 21.1 39.0 33.0 
G3 IBV 41.3 62.5 58.6 21.8 18.4 27.2 24.0 
Virus Predicted Protein Length (aa) 
SARS-CoV 306 932 601 1255 76 221 422 
CoV Range 302-307 923-940 506-600 1173-1452 76-108 225-262 377-454 


Fig. 2. Phylogenetic analysis and pairwise identities of coronavirus proteins. Predicted amino acid 
sequences of SARS-CoV proteins were compared with those from reference viruses representing 
each species in the three groups of coronaviruses for which complete genomic sequence informa- 
tion was available [group 1(G1): human coronavirus 229E (HCoV-229E), af304460; porcine 
epidemic diarrhea virus (PEDV), af353511; transmissible gastroenteritis virus (TGEV), aj271965. 
Group 2 (G2): bovine coronavirus (BCoV ), af220295; murine hepatitis virus (MHV), af201929. 
Group 3 (G3): infectious bronchitis virus (IBV ), m95169]. Sequences for representative strains of 
other coronavirus species, for which partial sequence information was available, were included for 
some of the structural protein comparisons [group 1: canine coronavirus (CCoV ), d13096; feline 
coronavirus (FCoV ), ay204704; porcine respiratory coronavirus (PRCoV ), 224675. Group 2: human 
coronavirus OC43 (HCoV-OC43), m76373, 14643, m93390; porcine hemagglutinating encepha- 
lomyelitis virus (HEV ), ayO78417; rat coronavirus (RtCoV ), af207551]. (A) Sequence alignments 
and neighbor-joining trees were generated by the use of ClustalX 1.83 with the Gonnet protein 
comparison matrix. The resulting trees were adjusted for final output with treetool 2.0.1. (B) 
Uncorrected pairwise distances were calculated from the aligned sequences with the Distances 
program from the Wisconsin Sequence Analysis Package, version 10.2 (Accelrys, Burlington, MA). 
Distances were converted to percent identity by subtracting from 100. aa, amino acid. 


Also, in some coronaviruses, the E protein is 
translated from the second ORF on a polycis- 
tronic mRNA (5, /6). 

Phylogenetic analyses of the sequence 
of SARS-CoV. To determine the relationship 
between SARS-CoV and the previously char- 
acterized coronaviruses, we compared the pre- 
dicted amino acid sequences for three well- 
defined enzymatic proteins encoded by the rep 
gene and the four major structural proteins of 
SARS-CoV with those from representative vi- 
ruses for each of the species of coronavirus for 
which complete genomic sequence information 
was available (Fig. 2). The topologies of the 
resulting phylograms are remarkably similar 
(Fig. 2A). For each protein analyzed, the spe- 
cies formed monophyletic clusters consistent 
with the established taxonomic groups. In all 
cases, SARS-CoV sequences segregated into a 
fourth, well-resolved branch. These clusters 
were supported by bootstrap values above 90% 
[1000 replicates (/7)]. Consistent with pairwise 
comparisons between the previously character- 
ized coronavirus species (Fig. 2B), there was 
greater sequence conservation in the enzymatic 
proteins [3CL?"°, polymerase (POL), and heli- 
case (HEL)] than among the structural proteins 
(S, E, M, and N). These results indicate that 
SARS-CoV is not closely related to any of the 
previously characterized coronaviruses and 
forms a distinct group within the genus Coro- 
navirus. SARS-CoV is approximately equidis- 
tant from all previously characterized coronavi- 
ruses, just as the existing groups are from one 
another. Detailed pairwise comparison by dot- 
plot analysis identified many regions of amino 
acid conservation within each protein (fig. S1), 
but the overall level of similarity between 
SARS-CoV and the other coronaviruses was 
low (Fig. 2B). No evidence for recombination 
was detected when the predicted protein se- 
quences were analyzed with the program Sim- 
Plot (77, 18). 

Predicted replicase gene products of 
SARS-CoV. Coronaviruses encode a chymot- 
rypsin-like protease, 3CL?"°, that is analogous 
to the main picornaviral protease 3C?™° (19). 
They also encode one (group 3) or two (groups 
1 and 2) papain-like proteases, termed PLP1?"° 
and PLP2?", which are analogous to the foot- 
and-mouth disease virus leader protease L”°. 
Overall, gene products of ORFla are poorly 
conserved among different coronaviruses, ex- 
cept for these protease sequences (fig. S1). The 
predicted gene product of ORFla of SARS- 
CoV appears to contain only one PLP?’® do- 
main at amino acids 1632 to 1847. The 3CL?° 
catalytic histidine and cysteine residues are ful- 
ly conserved among all coronaviruses (SARS- 
CoV amino acids His*?8! and Cys**8°), but 
coronaviruses appear to lack the conserved cat- 
alytic acidic residue that is characteristic of 
other 3C-like proteases (/9). The coronavirus 
replicase polyprotein is synthesized by a —l 
ribosomal frameshift at a conserved “slippery” 
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site (UUUAAAC) immediately upstream of a 
pseudoknot structure in the overlap of ORFla 
and ORF 1b. This polyprotein is autocatalyti- 
cally processed to yield the mature viral pro- 
teases PLPP'® and 3CL?, the RNA-dependent 
polymerase (POL), the RNA helicase (HEL), 
and other proteins whose functions have not 
been well characterized. The predicted ribo- 
somal frame shift at the SARS-CoV slippery 
site (nt 13,392 to 13,398) would result in trans- 
lation of 7073 amino acids from a single start 
site. 

Analysis of the predicted structural 
proteins of SARS-CoV. The structural pro- 
teins of coronaviruses (S, E, M, and N) 
function during host cell entry and virion 
morphogenesis and release (20). During viri- 
on assembly, N binds to a defined packaging 
signal on viral RNA, leading to the formation 
of the helical nucleocapsid. M is localized at 
specialized intracellular membrane struc- 
tures, and interactions between the M and E 
proteins and nucleocapsids result in budding 
through the membrane. In some group 2 
coronaviruses, the C terminus of M interacts 
with the nucleocapsid to form a core structure 
(21). The S protein is incorporated into the 
viral envelope, again by interaction with M, 
and mature virions are released from smooth 
vesicles (22). Bands corresponding to the 
predicted N and S proteins of SARS-CoV 
were visible in preparations of purified viri- 
ons that were analyzed by SDS—polyacrylam- 
ide gel electrophoresis; however, the assign- 
ment of other proteins in virions awaits the 
availability of specific antibodies to identify 
these viral proteins (fig. S4). 

The S proteins of coronaviruses are large 
type-I membrane glycoproteins that are respon- 
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sible both for binding to receptors on host cells 
and for membrane fusion. The S proteins of 
some coronaviruses are cleaved into S1 and S2 
subunits. S proteins also contain important vi- 
rus-neutralizing epitopes, and amino acid 
changes in the S proteins can dramatically affect 
the virulence and in vitro host cell tropism of the 
virus (23, 24). Because of the low level of 
similarity (20 to 27% pairwise amino acid iden- 
tity) between the predicted amino acid sequence 
of the S protein of SARS-CoV and the S pro- 
teins of other coronaviruses (Fig. 2B and fig. 
S1A), the comparison of primary amino acid 
sequences does not provide insight into the re- 
ceptor-binding specificity or antigenic proper- 
ties of SARS-CoV. 

The S protein of SARS-CoV has 23 poten- 
tial N-linked glycosylation sites (table S2). 
Functional motifs at the amino (N) and carbox- 
yl (C) termini of the S protein that are con- 
served among the coronaviruses are also 
present in the predicted SARS-CoV S protein, 
although the S2 domain is more conserved than 
the S1 domain. The N terminus of the SARS- 
CoV S protein contains a short type-I signal 
sequence composed of hydrophobic amino ac- 
ids that are presumably removed during co- 
translational transport through the endoplasmic 
reticulum. The C terminus, consisting of a 
transmembrane domain and a cytoplasmic tail 
rich in cysteine residues, is highly conserved in 
SARS-CoV (Fig. 3). At 52 amino acids in 
length, the SARS-CoV S protein is predicted to 
have the shortest transmembrane domain and 
cytoplasmic tail of any coronavirus analyzed 
(Fig. 3) (range, 61 to 74 amino acids). 

The current paradigm of protein-mediated 
membrane fusion proposes the collapse of al- 
pha-amphipathic regions in the C half of the 
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coronavirus S protein into coiled coils, thus 
bringing a fusion peptide toward the transmem- 
brane domain, resulting in cellular and viral 
membrane fusion. Two or three alpha-amphi- 
pathic regions are predicted for the C half of 
coronavirus S proteins. An alpha-amphipathic 
region of 116 amino acids was predicted with 
high confidence at positions 884 to 999 of the 
SARS-CoV S protein (fig. S2). Syncytia forma- 
tion, however, is not a prominent feature of 
SARS-CoV infection of Vero cells (5). The 
SARS-CoV S protein lacks the basic amino acid 
cleavage site found in group 2 and group 3 
coronaviruses (25), suggesting that the SARS- 
CoV S protein is probably not cleaved into S1 
and S2 subunits. 

Although overall sequence conservation is 
low (Fig. 2B), the predicted E, M, and N pro- 
teins of SARS-CoV contain conserved motifs 
that are found in other coronaviruses. Consistent 
with the E proteins of other coronaviruses, the 
predicted E protein of SARS-CoV contains a 
hydrophobic domain (residues 12 to 37) flanked 
by charged residues and followed by a cysteine- 
rich region. The N-terminal domains of corona- 
virus M proteins are exposed on the viral sur- 
face, whereas the C terminus is inside the viral 
membrane. Most coronavirus M proteins, in- 
cluding the predicted M protein of SARS-CoV, 
contain three hydrophobic transmembrane do- 
mains in the N-terminal half of the protein, 
although some viruses have four. A highly con- 
served amino acid sequence [SwWSFNPE 
(26)], immediately following the third hydro- 
phobic domain, is SMWSFNPE in the SARS- 
CoV M protein. The M proteins of coronavi- 
ruses are invariably glycosylated near the N 
terminus. Group 1 and group 3 coronaviruses 
are N-glycosylated, whereas those of group 2 
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Fig. 3. Conserved motifs in coronavirus S proteins. Alignment of 
the C-terminal region of the SARS-CoV and reference corona- 
virus S proteins was generated with ClustalX 1.83. Residues that 
match the SARS-CoV sequence exactly are boxed. The mem- 
brane-spanning domain and cytoplasmic tails are delineated 
with arrows. The amino acid sequence Y(V/I)KWPW(Y/W)VWL 
(26) is a conserved motif in all three coronavirus groups. The 
cysteine-rich region, which overlaps the membrane-spanning 
region and the cytoplasmic region, is also found in all corona- 
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Table 1. Classification of ORFs encoding potential nonstructural proteins of SARS-CoV. [The table shows 
the differences in nomenclature used to describe ORFs encoding potential nonstructural proteins of 
SARS-CoV in this report and in the report by Marra et al. (30). These differences are in nomenclature only, 
and the seven nt sequence differences between these strains do not change the position or number of 
ORFs (table $2). Because the complete SARS-CoV sequences have been available for only a few weeks and 
will probably be analyzed in great detail in the upcoming months, any nomenclature proposed at this time 
should be considered preliminary. The nomenclature used for the nonstructural proteins X1 to X5 is 
expected to be clarified once experiments on the transcriptional expression of the SARS-CoV genome are 


reported.] 
Genome location (nt)* Protein (number of This repork? Marra et al. 
amino acids) P (30) 
25,268 to 26,089 274 X11 ORF3 
25,689 to 26,150 154 X2 ORF4 
27,074 to 27,262 63 X3 ORF7 
27,273 to 27,638 122 X4 ORF8 
27,638 to 27,769 44 <50 amino acids ORF9 
27,779 to 27,895 39 <50 amino acids ORF10 
27,864 to 28,115 84 X5 ORF11 
28,130 to 28,423§ 98 See text ORF 13 
28,583 to 28,792§ 70 See text ORF 14 


*Based on the sequence of the Urbani strain of SARS-CoV (GenBank accession no. AY278741.1). 


tin this report, the 


ORFs encoding the predicted nonstructural proteins are designated as X1 to X5 and are numbered sequentially beginning 
at the 5’ terminus of the genome. Only ORFs encoding for predicted proteins longer than 50 amino acids are included 
in Fig. 1B. The locations and sizes of the ORFs encoding the predicted replicase protein, structural proteins, and 


nonstructural proteins are shown in table S2. 


{In Marra et al. (30), all of the ORFs, including those encoding the 


predicted replicase protein and structural proteins, are numbered sequentially from the 5’ terminus of the genome. This 


table shows only ORFs encoding predicted nonstructural proteins. 


protein. 


viruses are O-glycosylated (27, 28). The predict- 
ed M protein of SARS-CoV has an NGT near its 
N terminus, suggesting that this protein is N- 
glycosylated at position 4. 

The predicted N protein of SARS-CoV is 
a highly charged basic protein of 422 amino 
acids (range for other coronaviruses, 377 to 
454) with seven successive hydrophobic res- 
idues near the middle of the protein. Al- 
though the overall amino acid sequence ho- 
mology among coronavirus N proteins is low 
(Fig. 2B), a highly conserved motif [FY YL- 
GTGP (26)] occurs in the N-terminal half of 
all coronavirus N proteins, including that of 
SARS-CoV. Other conserved residues occur 
near this highly conserved motif (fig. S3). 

Conclusion. The completion of the 
genomic sequence of SARS-CoV provides a 
first look at the molecular characteristics of this 
virus and clearly demonstrates that this virus 
has features typical of a coronavirus, while it 
also has features that distinguish it from all 
previously sequenced coronaviruses. Relative 
to other coronaviruses, no significant major 
genomic rearrangements or any examples of 
large insertions or deletions in the genes coding 
for the replicase, S, E, M, or N proteins were 
found. Like some other coronaviruses, SARS- 
CoV has several small nonstructural ORFs that 
are found between the genes for S and E and 
between the genes for M and N. SARS-CoV is 
a novel virus that is phylogenetically distinct 
from other characterized coronaviruses. The ge- 
netic distance between SARS-CoV and any 
other coronavirus in all gene regions implies 
that no large part of the SARS-CoV genome 
was derived from other known viruses. The 
SARS-CoV genomic sequence does not pro- 


§These ORFs overlap the coding region of the N 


vide obvious clues concerning the potential an- 
imal origins of this pathogen. 

The genome of SARS-CoV has several 
unique features that could be of biological 
significance. The short anchor of the S 
protein, the specific number and location of 
small ORFs, and the presence of only one 
copy of the PLPP'® provide a combination 
of genetic features that readily differentiate 
this virus from previously described coro- 
naviruses. Of course, the significance of 
any of these features remains to be deter- 
mined experimentally. 

Successful control of the global SARS epi- 
demic will require the development of vaccines 
and antiviral compounds that effectively prevent 
or treat this disease, as well as rapid and sensi- 
tive diagnostic tests to monitor its spread. The 
availability of complete genomic sequences (ta- 
ble S3) (29) of SARS-CoV in just a few weeks 
after the discovery of the virus should have an 
immediate impact on disease control efforts by 
making it possible to develop improved diag- 
nostic tests, vaccines, and antiviral agents. The 
sequence information will also make it possible 
to identify the origin and natural reservoir of this 
virus and to contribute to studies of the immune 
response to this virus and the pathogenesis of 
SARS-CoV-telated disease. The stage is set for 
the international scientific community to re- 
spond and to rapidly develop the tools to control 
this emerging infectious disease. 
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We sequenced the 29,751-base genome of the severe acute respiratory syndrome 
(SARS)—associated coronavirus known as the Tor2 isolate. The genome sequence 
reveals that this coronavirus is only moderately related to other known corona- 
viruses, including two human coronaviruses, HCoV-OC43 and HCoV-229E. Phy- 
logenetic analysis of the predicted viral proteins indicates that the virus does not 
closely resemble any of the three previously known groups of coronaviruses. The 
genome sequence will aid in the diagnosis of SARS virus infection in humans and 
potential animal hosts (using polymerase chain reaction and immunological tests), 
in the development of antivirals (including neutralizing antibodies), and in the 


identification of putative epitopes for vaccine development. 


An outbreak of atypical pneumonia, referred 
to as severe acute respiratory syndrome 
(SARS) and first identified in Guangdong 
Province, China, has spread to several coun- 
tries. The severity of this disease is such that 
the mortality rate appears to be ~3 to 6%, 
although a recent report suggests this rate can 
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be as high as 43 to 55% in people older than 60 
years (/). A number of laboratories worldwide 
have undertaken the identification of the caus- 
ative agent (2, 3). The National Microbiology 
Laboratory in Canada obtained the Tor2 isolate 
from a patient in Toronto and succeeded in 
growing a coronavirus-like agent in African 
green monkey kidney (Vero E6) cells. This 
coronavirus was named publicly by the World 
Health Organization and member laboratories 
as the “SARS virus” (WHO press release, 16 
April 2003) after tests of causation according to 
Koch’s postulates, including monkey inocula- 
tion (4). This virus, which we refer to as SARS- 
HCoV, was purified, and its RNA genome was 
extracted and sent to the British Columbia Cen- 
tre for Disease Control in Vancouver for ge- 
nome sequencing by the BCCA Genome Sci- 
ences Centre. 
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The coronaviruses are members of a family 
of enveloped viruses that replicate in the cyto- 
plasm of animal host cells (5). They are distin- 
guished by the presence of a single-stranded 
plus-sense RNA genome about 30 kb in length 
that has a 5’ cap structure and 3’ polyadenyla- 
tion tract. Upon infection of an appropriate host 
cell, the 5’-most open reading frame (ORF) of 
the viral genome is translated into a large 
polyprotein that is cleaved by viral-encoded pro- 
teases to release several nonstructural pro- 
teins, including an RNA-dependent RNA poly- 
merase (Rep) and an adenosine triphosphatase 
(ATPase) helicase (Hel). These proteins, in turn, 
are responsible for replicating the viral genome 
as well as generating nested transcripts that are 
used in the synthesis of the viral proteins. The 
mechanism by which these subgenomic 
mRNAs are made is not fully understood. How- 
ever, recent evidence indicates that transcrip- 
tion-regulating sequences (TRSs) at the 5’ end 
of each gene represent signals that regulate the 
discontinuous transcription of subgenomic 
mRNAs. The TRSs include a partially con- 
served core sequence (CS) that in some corona- 
viruses is 5'-CUAAAC-3’. Two major models 
have been proposed to explain the discontinuous 
transcription in coronaviruses and arterioviruses 
(6, 7). The discovery of transcriptionally active, 
subgenomic-size minus strands containing the 
antileader sequence and of transcription inter- 
mediates active in the synthesis of mRNAs 
(S—11) favors the model of discontinuous tran- 
scription during the minus strand synthesis (7). 

The viral membrane proteins, including the 
major proteins S (Spike) and M (membrane), are 
inserted into the endoplasmic reticulum (ER) 
Golgi intermediate compartment while full- 
length replicated RNA plus strands assemble 
with the N (nucleocapsid) protein. This RNA- 
protein complex then associates with the M 
protein embedded in the membranes of the ER, 
and virus particles form as the nucleocapsid 
complex buds into the lumen of the ER. The 
virus then migrates through the Golgi complex 
and eventually exits the cell, likely by exocyto- 
sis (5). The site of viral attachment to the host 
cell resides within the S protein. 

The coronaviruses include a large number 
of viruses that infect different animal species. 
The predominant diseases associated with 
these viruses are respiratory and enteric in- 
fections, although hepatic and neurological 
diseases also occur. Human coronaviruses 
identified in the 1960s (including the proto- 
type viruses HCoV-OC43 and HCoV-229E) 
are responsible for up to 30% of respiratory 
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